Modeling biological systems

ABSTRACT

Biological systems are modeled using formal languages and theorem provers and model checkers and term rewriting systems. The models include rules that express a substitution of at least one symbol by at least another symbol. The symbols represent elements of the biological system, and the rules are expressed in a manner that, for example, enables an inference engine to infer alternative results from the system based on an initial hypothetical state.  
     Inference engines are also applied to symbolically simulate, test properties, and explore the biological system. Abstractions and algorithms can be employed to enable symbolic calculation of state sets for the biological system.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional ApplicationNo. ______ filed on May 10, 2001, entitled “Modeling BiologicalSystems”, naming Patrick D. Lincoln and Keith R. Laderoute as inventors,the contents of which are incorporated herein in their entirety byreference.

BACKGROUND

[0002] This invention relates to modeling biological systems.

[0003] In many respects, cells are living information processors thatrespond and adapt to their environment. They can sense multipleparameters, integrate signals, and regulate responses. In multicellularorganisms, cells process sensory stimuli from surrounding and evendistant cells to orchestrate ornate patterns of differentiation.

[0004] The processing of signals in cells is typically mediated bypolypeptides and nucleic acids. Human cells, for example, are estimatedto have approximately 30,000 genes (International Human GenomeSequencing Consortium (2001) Nature 409:860 and Venter et al. (2001)Science 291:1304). Each cell can express a subset of these genes.Expressed genes are typically translated to produce polypeptides withparticular functional properties. Many polypeptides effect cellularevents by interacting with other components, either by physicallyassociating with or modifying other compounds. Polypeptides can functiontogether as assemblies, for example in signaling and in providingmetabolic pathways. Notably, the functions of polypeptides can be highlyregulated, e.g., by other polypeptides. The relationships amongpolypeptides, nucleic acids, and other cellular components endow cellswith a complex network of molecular elements that sense and propagatesignals to control cellular behavior.

SUMMARY

[0005] The invention is based, in part, on the discovery that rewritinglogic, formal languages, and formal language tools can be used to modelinteractions in biological systems.

[0006] In one aspect, the invention features a method that includesgenerating a model that includes rules for a biological system. Each ofthe rules expresses a substitution of at least one symbol by at leastanother symbol, the symbols representing biological elements. At leastsome of the rules are expressed in a manner than enables an inferenceengine to infer alternative results from the system based on an initialhypothetical state.

[0007] Implementations of the invention may include one or more of thefollowing features. One or more of the rules may include an operatorthat expresses a relationship between at least two of the biologicalelements. The operator may conform to one or more properties selectedfrom the group consisting of associativity, commutativity, idempotence,and identity. For example, the operator may be associative andcommutative or associative, commutative, and idempotent. One or more ofthe rules may express concurrent state transitions. One or more of therules may be non-terminating. Further, one or more of the rules may beconditional. In certain embodiments, one or more of the rules representsa feedback or feedforward interaction between biological elements. Oneor more of the rules may be reflective.

[0008] At least some of the symbols representing the biological elementsare typed. The types may be organized in hierarchical classes.

[0009] The method may also include expressing the rules or the systemgraphically by representing at least some of the symbols as points, andat least some of rules as lines interconnecting points, eachinterconnected point corresponding to a symbol that is an operand of therule. The resulting image is a wiring diagram.

[0010] In addition, the method may include processing the rules with aninference engine. The inference engine symbolically simulates thebiological reactions of interest. The inference engine can determine allpossible precursor states for a particular state of interest, given aset of reactions expressed as rules. The inference engine can determineif the rules are terminating and/or Church-Rosser or to identify afeedback or feedforward interaction. The inference engine can processthe rules using associative-commutative matching.

[0011] The method can further include representing biological states asvectors of logical properties. The method can further include processingthe rules using a model checker. It can further include displaying arepresentation of a decision diagram. It can further include displayinga wiring diagram, e.g., as a hypergraph, and, optionally, computing thetransititive closure of that hypergraph.

[0012] The method can further include generating an algebraicabstraction of components of the biological system. These algebraicexpressions can include compact polynomial expressions of all relevantbiological molecules being studied. The symbols may represent moleculesin the biological system. For example, one or more symbols may representan element that is selected from the group consisting of a polypeptide,a nucleic acid, a metabolite, a lipid, and a small molecule. In oneembodiment, at least one symbol represents a polypeptide selected fromthe group consisting of a protein kinase, a nucleotide-binding protein,a transcription factor, a phosphatase, and a protease. In anotherembodiment, at least one symbol represents a drug, toxin, non-selfantigen, or other exogenous agent. In still another embodiment, at leastone symbol represents an antibody, a hormone, or a cytokine. In anotherembodiment, at least one symbol represents a gene or portion of agenetic network.

[0013] At least some of the symbols may represent a post-translationalmodification, e.g., phosphorylation, acetylation, ubiquitination,proteolysis, or methylation.

[0014] The model of the biological system may include symbolsrepresenting molecules in a first cell and other symbols representingmolecules in a second cell. In another embodiment, the model includessymbols represent each individual molecule of a network or systempresent in a cell. In another embodiment, each organelle of a cell maybe represented as a separate collection (e.g. multiset of symbols underan associative and commutative operator), and all membrane transport andother interactions represented as a set of rules. In another embodiment,each organelle, and each membrane of the cell may be represented as aseparate collection. In another embodiment, each scaffold protein withits associated proteins and protein complexes may be represented as aseparate collection (e.g. set of symbols under an associative,commutative, and idempotent operator).

[0015] In another example, the method of generating rules includesparsing a protein-protein interaction map into rules.

[0016] In another aspect, the invention features a method of processingan initial state. The method includes receiving in an inference engine aset of symbols that represent a hypothetical initial state of abiological system, and processing the initial state using rules to inferalternative resultant states of the system. Each of the rules expressesa substitution of at least one symbol by at least another symbol. Thesymbols represent elements of the biological system. In one embodiment,the inference engine determines all possible alternative resultantstates of the system from a given initial state.

[0017] The inference engine may be configured to detect infinitesubstitution chains or feedforward and feedback interactions.

[0018] The method may also include comparing the resultant states to aset of symbols representing a hypothetical final state of the system todetermine if the system may transition from the hypothetical initialstate to the hypothetical final state.

[0019] The method may also include determining all possible states whichlead to a given final state. The method may include determining allpossible states which lead to any state satisfying a given predicate.

[0020] The method may further include parsing a profile (e.g., a geneexpression profile or a polypeptide profile) into symbols, and includingat least some (e.g., one or more) of the symbols in the set of symbolsrepresenting the hypothetical initial state of the systems. Anexpression profile includes information about the expression level ofgenes in a sample or a cell. Similarly, a polypeptide profile includesinformation about the abundance and/or modification state ofpolypeptides in a sample or a cell. The profile may be obtained from abiological sample, e.g., a sample associated with, having apredisposition for, or suspected of having a disease or disorder.Examples of diseases and disorders include cancer, diabetes, infection(e.g., by a pathogen), inflammation, and experimentally inducedconditions on the cell. The sample can be obtained from a patient ormodel organism.

[0021] The hypothetical initial state may include information about agenetic alteration or mutation (e.g., such as a substitution, insertion,deletion, translocation, or trinucleotide repeat expansion). The methodmay include providing additional properties (e.g., such as symbols for adrug or other exogenous agent) to the hypothetical initial state, andcomparing the alternative resultant states to a reference state, e.g.,the state of a normal cell or to the alternative resultant statesidentified in the absence of the additional property.

[0022] In another aspect, the invention features a method of processingan initial state. The method includes receiving a set of symbols in aninference engine, the set representing a hypothetical initial state of abiological system, the symbols representing biological elements of thesystem; and iteratively substituting symbols representing biologicalelements by other symbols representing biological elements using rulesthat represent interactions between the biological elements until aterminal state or until alternative resultant states are detected.

[0023] The method may also include outputting the terminal state or atleast one of the alternative resultant states, e.g., as a graphicaldisplay. The method may also include outputting large sets of possibleterminal states as a set of alternatives, e.g., as as a tree or graphrepresenting those possible terminal states. The method may includenavigation of such possible terminal states with logical operators.

[0024] In still another aspect, the invention features a method ofevaluating a rule describing an interaction between elements of abiological system. The method includes: receiving in an inference engine(1) at least first and second sets of symbols, representing hypotheticalfirst and second states of a biological system, the symbols representingbiological elements of the biological system and (2) rules that expressa substitution of symbols representing biological elements by othersymbols representing biological elements; and either determining if oneor more of the rules must be true or false for the first state to reachthe second state by processing the first state using the rules ordetermining if the first state can reach or progress to the second stategiven the rules. For example, the method uses a theorem prover to proveor disprove a theory, e.g., a theory implicit in a contemplated rule.The proof may depend on the interrelationship of the first state withthe second state. The hypothetical first and second states may representdifferent cells, e.g., a normal cell or a cell with a precondition,disease or disorder. If a rule must be true for the biological system toprogress from the first state to the second state, the operands of therule may identify elements of the biological system that are drugtargets. For example, preventing the element represented by one of theoperands from functioning is likely to prevent the biological systemfrom progressing from the first (e.g., normal) state to the second state(e.g., a diseased state).

[0025] In yet another aspect, the invention features a method ofmodeling a biological system. The method includes identifying geneexpression profiles for a first and second sample, each gene expressionprofile representing the state of a biological system; and generatingthe first and second sets of symbols from the profiles for each statefor input into the inference engine. The first and second samples mayhave one or more genetic alterations with respect to one another.

[0026] The invention also features an article that includes amachine-readable media having encoded thereon a model of a biologicalsystem, the model comprising rules that express a substitution ofsymbols representing biological elements by other symbols representingbiological elements. At least some of the rules being expressed in amanner than enables an inference engine to infer alternative resultsfrom the system based on an initial hypothetical state.

[0027] Also featured is an apparatus that includes a processor andsoftware configured to cause the processor to: receive a set of symbols,the set representing a hypothetical initial state of a biologicalsystem, the symbols representing biological elements of the system; anditeratively substitute symbols representing biological elements by othersymbols representing biological elements using rules that representinteractions between the biological elements until a terminal state oruntil alternative resultant states are detected.

[0028] Other features and advantages of the invention will be apparentfrom the description and the claims.

DESCRIPTION OF DRAWINGS

[0029]FIG. 1 is a flow chart of an exemplary process for generating andusing rules to model biological systems.

[0030]FIG. 2 is a block diagram of the Maude Interpreter's RewritingEngine.

[0031]FIG. 3A and 3B are wiring diagrams for pRb regulatory pathways.

DETAILED DESCRIPTION

[0032] The methods described here generally use the semantic and logicalframework of formal methods to model the circuitry of biologicalsystems. Referring to the example shown in FIG. 1, information about abiological system is extracted 110, e.g., from direct observations,experiments, and scientific literature (e.g., PubMed,http://www.ncbi.nlm.nih.gov/entrez/). Elements in the biological systemare identified and represented 120 as symbols. Symbols can be related toone another using hierarchical data types. The information is thenparsed to formulate 130 rules about the system. The symbols and rulescan be used in a variety of methods. Non-limiting examples includeevaluating a hypothetical initial state for the biological system 140and 142; testing a theorem 150 and 152; and checking a model.

[0033] To evaluate an initial state, information about a hypotheticalinitial state is provided 140 (e.g., in symbolic form or as raw datathat is processed to symbolic form). An inference engine then evaluates142 the hypothetical initial state by iteratively processing the rules.One or more terminal states or alternative resultant states areidentified and outputted 144. The output can be rendered 180 in a formatconvenient for a user. In addition, the output can be used forsubsequent analysis 145, e.g., as a second hypothetical initial state orfor a comparison to one or more reference states.

[0034] To test a theorem 152, information about known states of thebiological system is identified and parsed into symbolic form 150. Atheorem is then evaluated 152 by a theorem prover supplied with rulesand the known states of the biological system. If a proof is possible,the theorem prover generates an output 180 as to whether the theorem istrue or false given the supplied information. A proven theorem can beused as a rule in subsequent analysis 154.

Rewriting Logic

[0035] Rewriting logic is a form of logical computation that hasperformance features well suited for the specification and analysis ofcomplex systems. In its simplest embodiment, a rewriting rule is asubstitution of symbols. For example, the expression t→t′ is a rewritingthat expresses a local state transition in which a portion of a system'sstate that is represented by t (the arity) is changed to a new staterepresented by t′ (the coarity). The rewriting operation can beimplemented in a computer system to replace instances of the symbol twith t′ in a memory store. Each operator can have one or more of theproperties of associativity, commutativity, idempotence, and identityamong others. For example, in representing a protein complex of twoproteins A and B, we may use the operator “:”, thus forming (A:B).

[0036] The operator “:” can be commutative. For example,

[0037] (A:B) is equivalent to (B:A)

[0038] The operator “:” can be associative. For example,

[0039] (A:B):C is equivalent to A:(B:C).

[0040] The operator “:” can be idempotent. For example,

[0041] (A:A) is equivalent to A.

[0042] The operator “:” can have an identity. For example,

[0043] (A:Id) is equivalent to A.

[0044] Multiple operators can be used to express different relationshipsbetween components of a biological system. For example, a firstcommutative operator “:” can be used to represent protein complexes, anda second associative-commutative operator “;” can be used to representthe cytoplasm. A third associative commutative operator “|” can be usedto represent the membrane of an organelle.

[0045] Rewriting change can occur independently from any othernon-overlapping state change. Hence, the rewriting rules can evaluateconcurrent state changes, e.g., for highly nondeterministic concurrentcomputations. When applied to a hypothetical state of a system, a set of“terminating” rules can arrive at a solution for which no further statetransitions can be applied. For example, the set of rules:

a:b→b:a,b:a→c

[0046] is terminating. When the input state is “ab,” the rewriting rulesreach the solution, “c”. In contrast, “non-terminating” rules do notreach such a solution. For example, the set of rules:

a:b→b:a,b:a→a:b

[0047] is non-terminating. When the input state is “ab,” the rewritingrules do not reach a solution for evaluating the system. Non-terminationrules can cause “infinite substitution chains.” The inference engine canbe configured to detect such infinite substitution chains. Further, thetwo nonterminating rules above effectively express commutativity. Aninference engine may detect rules expressing commutativity of anoperator, and replace them with the explicit notation that the operatoris commutative. This requires the inference engine to have built-intreatment of commutativity, e.g., including commutative matching.Similarly, the inference engine can detect rules expressingassociativity. Such features of the inference engine are termedassociative-commutative matching.

[0048] An inference engine for processing rewriting rules can implementmodel checking algorithms. Model checking allows complex properties tobe checked over all possible computation paths with relative efficiency(J. R. Burch, et al. (1990) “Sequential Circuit Verification UsingSymbolic Model Checking” Proc. 27^(th) DAC 1990; pp. 45-51.).

[0049] In some embodiments, the inference engine produces one possibleoutput state as a forward symbolic simulation of the biological system.

[0050] In some other embodiments, the inference engine for processingrewriting rules searches for multiple resulting states, or all possibleresulting states from a given initial state, e.g., an exhaustivesimulation of all possible nondeterministic steps of rewriting. Forexample, at each step where more than one rewriting rule may beapplicable, the engine records record the possible set of choices. Thecomputational inference engine moves forward along one branch ofpossible system evolution. When a final state is reached, it is recordedor output, e.g., in a compact fashion, and then the inference enginebacktrack to the last choice made among possible branches. The inferenceengine then notes that the fully explored branch is done, and proceedalong another branch. In this way, the inference engine cansystematically explore all possible outcomes. Efficient implementationof this search for all possible final states is accomplished through theuse of efficient indexing and state-storage techniques.

[0051] Rewriting logic is “reflective,” and thus can be used to make andevaluate assertions about itself. This capability is also described as“metalogic” or “metaprogramming.” Thus, a rewriting logic computationalenvironment, such as Maude (see below), can be used to formulatemetalogic statements that test theories and/or analyze the properties ofrules and systems.

[0052] Formal methods can be used to query a set of rules to determineif a biological interaction is possible, e.g., to determine if a testrule is consistent with a set of predefined rules or to predict theoutcome of adding a test rule to a set of predefined rules. The systemcan be used to determine if a particular state of a system is possibleor reachable given a set of predefined riles. For example, the inferenceengine can compute if the rules allow the state transitions necessaryfor an initial state to reach a particular final state. In anotherexample, the inference engine is used to determine if the particularfinal state is included among the set of alternative resultant states.

[0053] For example, rewriting logic can be used to determine if a set ofrules are terminating and Church-Rosser, e.g., a property of system thatcan be reduced to a unique normal form. For example, the Church-Rosserchecker described in Duran and Meseguer ((July 2000) “A Church-RosserChecker Tool for Maude Equational Specifications” SRI International andUniversidad de Málaga) can be used to evaluate a set of rules for abiological system.

[0054] A rewriting logic interpreter can include syntactic support forobject-oriented computation. Symbols can be assigned to object classes,e.g., by declaration or by an operator. Hence, membership axioms can beevaluated, e.g., to determine if a symbol is a member of a particulargroup. These features and others (such as the use of sorts, subsorts,and operator overloading) enable membership equational logic to be usedto analyze biological systems.

[0055] Membership equational logic is an equational logic extended witha strong sort (or type) system that allows predicate definitions ofsorts, and an explicit predicate enabling testing of membership in asort. For example, a sort of protein can be defined and a predicate of“kinase?” can be defined. Using membership equational logic, the sort ofkinases can be defined as the components of the sort protein thatsatisfy the “kinase?” predicate. Variables and operators can then bedeclared to operate on kinases, e.g., activation by phosphorylation andso forth. Other useful examples of these predicate sorts include the setof nonzero numbers (and then division can be declared to be defined onlyover the sorts of numbers divided by nonzero numbers), and proteincomplexes satisfying certain interesting properties (and then reactionsmay be enabled only for protein complexes of that particular sort).

[0056] Two examples of computer environments that support rewritinglogic are (1) the PVS specification language and theorem prover, and (2)the Maude rewriting engine. Both are available from the SRI ComputerScience Laboratory (SRI International, Palo Alto Calif.;http://www.csl.sri.com/). Other examples of computer languages usefulfor formal methods include Café (Futatsugi and Sawada (1994) “Café as anExtensible Specification Environment” In Proc. Kunming InternationalCASE Symposium) and ELAN (Borovansky et al. (1996) “ControllingRewriting by Rewriting” in Proc. First International Workshop onRewriting Logic, Electronic Notes in Theoretical Computer Science. Vol.4 Elsevier). For a review of rewriting logic see, e.g., Meseguer (1998)“Research Directions in Rewriting Logic” In Computational Logic, ed.,Berger and Schwichtenberg, Springer-Verlag; Meseguer (1996) Proc. FirstInternational Workshop on Rewriting Logic, Electronic Notes inTheoretical Computer Science. Vol. 4 Elsevier; and Kirchner and Kirchner(1998) Proc. Second International Workshop on Rewriting Logic,Electronic Notes in Theoretical Computer Science. Vol. 15 Elsevier.

Maude

[0057] Maude is a computer-based language that efficiently supportsrewriting logic computation, equational computation, and algebraicspecification. Some algebraic features of Maude are implemented in theOBJ style (Goguen et al. (2000) “Introducing OBJ” In SoftwareEngineering with OBJ: Algebraic Specification in Action, pp. 3-167,Kluwer). Maude rewriting logic uses algrebraic determination to identifyall possible configurations of a system, e.g., a system subject toconcurrent changes. With standard hardware, such as a Pentium IIprocessor, the raw rewriting speed of Maude is over 10 million rewritesper second for simple rule sets.

[0058] Referring to FIG. 2, the Maude system includes features buildaround a Maude interpreter 200. The interpreter 200 is implemented inC++ and has two principal components: the rewriting engine 220 and themixfix front end 210.

[0059] The rewriting engine is modular and includes two key components:the core module 230 and the interface module 240. The core module 230includes classes for objects, not specific to an equational theory, suchas equations, rules, sorts, and connected sort components. “Sorts” and“subsorts” are datatypes that can be hierarchically related. The coremodule can implement substitutions of symbols, e.g., as specified byrewriting rules. The theory interface module 240 supports equationaltheory and includes abstract base classes, symbols, and matchingautomata. The engine is flexibly designed so that new symbols forspecial rewriting semantics can be added. The theory interface can beused for free theory, associative commutative (AC) theory, and so forth.

[0060] The mixfix front end 210 of the Maude system includes abison/flex parser for syntax analysis, a grammar generator, a parser, apretty printer, and a debugger.

Prototype Verification System (PVS)

[0061] PVS is an example of a computing environment that providesmechanized support for formal specification and verification. PVSconsists of a specification language, a number of predefined theories, atheorem prover, and various utilities (see, e.g., Shankar (1996) FormalMethods in Computer Aided Design (FMCAD'96, Palo Alto, Calif.), LectureNotes in Computer Science 1166, pp. 257-264, Springer.)

[0062] PVS Language. The specification language of PVS is based onclassical, typed higher-order logic. The base types includeuninterpreted types, which may be introduced by the user, and built-intypes, such as the Booleans, integers, reals, and ordinals; thetype-constructors include functions, sets, tuples, records,enumerations, and recursively-defined abstract data types, such as listsand binary trees. Predicate subtypes and dependent types can be used tointroduce constraints, such as the type of prime numbers. Theseconstrained types may incur proof obligations during typechecking, butgreatly increase the expressiveness and naturalness of specifications.In practice, most of the obligations are discharged automatically by thetheorem prover. PVS specifications are organized into parameterizedtheories that may contain assumptions, definitions, axioms, andtheorems. Definitions are guaranteed to provide conservative extension;to ensure this, recursive function definitions generate proofobligations. Inductively-defined relations are also supported. PVSexpressions provide the usual arithmetic and logical operators, functionapplication, lambda abstraction, and quantifiers, within a naturalsyntax. Names may be freely overloaded, including those of the built-inoperators such as AND and +. Tabular specifications of the kindadvocated by Parnas are supported, with automated checks fordisjointness and coverage of conditions. An extensive prelude ofbuilt-in theories provides hundreds of useful definitions and lemmas;user-contributed libraries provide many more.

[0063] PVS Theorem Prover. The PVS theorem prover provides a collectionof powerful primitive inference procedures that are appliedinteractively under user guidance within a sequent calculus framework.The primitive inferences include propositional and quantifier rules,induction, rewriting, and decision procedures for linear arithmetic. Theimplementations of these primitive inferences are optimized for largeproofs: for example, propositional simplification uses binary decisiondiagrams (BDDs), and auto-rewrites are cached for efficiency.User-defined procedures can combine these primitive inferences to yieldhigher-level proof strategies. Proofs yield scripts that can be edited,attached to additional formulas, and rerin. This allows many similartheorems to be proved efficiently, permits proofs to be adjustedeconomically to follow changes in requirements or design, and encouragesthe development of readable proofs. PVS includes a BDD-based decisionprocedure for the relational mu-calculus and thereby provides anexperimental integration between theorem proving and model checking.

[0064] PVS Interface. PVS uses Gnu or X Emacs to provide an integratedinterface to its specification language and prover. Commands can beselected either by pull-down menus or by extended Emacs commands.Extensive help, status-reporting and browsing tools are available, aswell as the ability to generate typeset specifications (in user-definednotation) using LaTeX. Proof trees and theory hierarchies can bedisplayed graphically using Tc1/Tk.

[0065] Applications of PVS. PVS can be used for the formalization ofrequirements and design-level specifications, and for the analysis ofintricate and difficult problems, e.g., for testing algorithms andarchitectures for fault-tolerant flight control systems, hardwaresystems, and real-time system design (see, e.g., Srivas et al. (1998)Chapter 4 of Formal Hardware Verification: Lecture Notes in ComputerScience, T. Kropf (ed.), Vol 1287, pp. 156-205, Springer Verlag. The PVSlanguage and environment can be used to test specifications ofbiological networks, and even to design biological circuits, e.g., incombination with genetic engineering techniques.

Symbolic Representation of Elements of Biological Systems

[0066] Models of biological systems can include symbols representing adiverse variety of elements. Some key elements are macromolecularpolymers such as polypeptides and nucleic acids. Ribonucleic acids caninclude messenger RNAs (mRNAs), introns, tRNAs, viral RNA genomes,catalytic RNAs (e.g., ribosomal RNAs, snRNAs, and artificial ribozymes),and exogenously supplied RNAs (e.g., double stranded RNAs).Deoxyribonucleic acids can include genomic nucleic acids such aschromosomal, mitochondrial, and chloroplast nucleic acids, andextrachromosomal nucleic acids such as episomes (e.g., plasmids andother exogenously supplied nucleic acids). Elements can also refer tothe different nucleic acid sequences such as coding regions, regulatoryregions (e.g., promoters, enhancers, 5′ untranslated regions, 3′untranslated regions, and internal ribosome entry sites), insulators,telomeres, centromeres, satellite repeats and so forth.

[0067] Each nucleic acid and polypeptide can also be modified byinformation about its sequence, e.g., whether the element is wild-typeor a mutant or whether the element is associated with a biallelic markeror single nucleotide polymorphism (SNP).

[0068] Other elements can include small molecules such as metabolites,cofactors, drugs, toxins, and hormones. Small molecules are typicallycompounds with a molecular weight of less than 5,000 Daltons. Eachmolecule may also have a variety of states, e.g., oxidized or reduced.Small molecules can also include ions (e.g., calcium, sodium, potassium,chloride, and acetate), cofactors, metabolites, second messengers (e.g.,cyclic AMP and phosphoinositides), and exogenous agents such as drugs,therapeutic polypeptide (e.g., therapeutic antibodies), peptide nucleicacids (PNA), and toxins (e.g., carcinogens).

[0069] Also represented are complexes of components (e.g.,the product ofbinding of one protein to another, of a chemical (e.g., a drug, asubstrate, or an allosteric regulator) to a protein, and of a nucleicacid to a protein), modification states of components (e.g., proteins,RNA, and DNA), conformations of components (e.g., open/closed state ofan ion channel, or active/inactive state of an enzyme), and theoligomerization states of a components (e.g., polymerization state ofactin and microtubules).

[0070] A predicate is a function taking zero or more arguments andproducing either true or false. One example of a predicate, isAristotle's use of Mortal? as a predicate when reasoning about “all menare mortal, Socrates is a man, so Socrates is mortal”. Some predicatesuseful for symbolic representations of biological systems includeindividual properties of molecules (such as the ubiquitination state ofprotein or the methylation state of a gene promoter), properties ofprotein complexes, and properties of entire organelles, cells, tissues,and organisms.

[0071] Components can also be compartmentalized to different physicalregions of the system. For example, symbols can refer to elements fromparticular subcellular organelles, such as the endoplasmic reticulum,Golgi, lysosome, mitochondria, plasma membrane, chloroplast, andnucleus. Some symbols describe extracellular components, e.g.,components in the extracellular matrix, interstitial fluid, blood,serum, lymph, or on adjacent cells. In systems that model a plurality ofcells, symbols can also be associated with a particular cell of theplurality.

[0072] Components can be compartmentalized or localized by associationwith another component. For example, some polypeptides are bound topolypeptide scaffolds, an actin cytoskeletal element, a microtubulecytoskeletal element, a transmembrane receptor, or molecular machinesuch as a proteasome. See also “Protein-Protein Interaction Maps,”below. These relationships can be represented symbolically.

[0073] Symbols can also characterize the physical properties of abiological system, e.g., hydrostatic pressure, osmotic pressure, othergradients, which can be represented as symbols. For example, pressurecan be described as high, normal, or low. Additional physical parametersinclude ion concentration, membrane polarization (e.g., charge),membrane permeability, membrane fluidity, membrane/vesicle trafficking,pH (e.g., vesicle pH), oxidative environment, and temperature (e.g.,heat shock, cold shock, or normal). pH can be described as acidic,basic, or neutral.

[0074] Modules can be created that define related groups of symbols. Amodule can optionally include data typing, so as to hierarchicallyrelate components of the systems. The definitions allow the symbolparser to read, write, and process symbols. Once generated, modules canbe reused for implementations of different biological systems. Custommodules can be made for particular groups of elements that may be new orunique to a system of interest. Typically, the symbols are typographicalstrings of characters that are easily recognizable acronyms of the namesof the elements that they represent. Thus, the input and output from themodels are easily understood by users as well as by the languageinterpreter, e.g., Maude's context-free parser. However, less convenientsymbols can also be used. For example, if the user does not have tomanipulate the symbols, but uses a graphical interface for interactingwith the computer system, the typographical rendering of the symbols isof little import.

[0075] Symbols are used in statements about the systems. Examples ofstatements include rules (particularly rewriting rules), axioms, andequations.

Rules

[0076] Rules can be generated from direct experimental observation,inference, analysis of scientific literature, available diagrams (e.g.,wiring diagrams and interaction maps) of biological systems, and curatedand/or annotated knowledge-bases. Knowledge of these interactions can becurated and used to render detailed representations of the molecularnetworks in cells. Such efforts are providing a holistic understandingof molecular interactions (see, e.g., Karp et al. (2000) Nucl Acids Res28:56-59 and Weng et al. (1999) Science 284:92-96). A “wiring diagram”or “interaction map” for a biological system can be assembled fromanalysis of the scientific literature and/or collation of results fromexperiments (Kohn (1999) Mol. Biol. of the Cell 10:2703-2734 andExample, below). For example, the experiments can be the result ofgenetic analysis, e.g., phenotypes and genotypes, and biochemicalexperiments, e.g., physical interactions and reaction assays.

[0077] Generally, rules can be formulated based on the net effect of aninteraction. Some examples are listed in Table 1 below. TABLE 1Molecular Relationship Rewriting Rule protein A causes synthesis ofprotein B A → A B protein A destroys protein B A B → A protein Aactivates protein B A B → A B: active protein A inactivates protein B AB: active → A B protein A and C activate protein B A C B → A C B: active

[0078] Further guidance is also available from the specific Examples.

[0079] Conditional rules are used to express a substitution of symbolsthat only occurs when a condition is met. The condition can be expressedas an “if” statement. The “if” statement is evaluated first, and, iftrue, the rewriting rule is performed.

[0080] The use of typing enhances the capacity and expressiveness ofrewriting rule statements. For example, if proteins M and N are of typeA, then the rewriting rule A→A B is evaluated as follows:

[0081] given “M”, the rewriting rule produces “M B”;

[0082] given “N”, the rewriting rule produces “N B.” Thus, symbols thatare declared to be of type A (or a subtype thereof), are evaluated as Aby the rule.

Computer Systems

[0083] The formal language environments described here are not limitedto any particular hardware or software configuration as they can findapplicability in any computing or processing environment. They can beimplemented in hardware, software, or a combination of the two. They canbe generated using a high level procedural language, object orientedprogramming language, or another formal language to communicate with amachine system. However, the programs can also be implemented inassembly or machine language, if desired. Each such program may bestored on a storage medium or device, e.g., compact disc read onlymemory (CD-ROM), hard disk, magnetic diskette, or similar medium ordevice, that is readable by a general or special purpose programmablemachine for configuring and operating the machine when the storagemedium or device is read by the computer to perform the proceduresdescribed in this document. The storage medium can be also be madeaccessible across a computer network, e.g., the Internet.

[0084] User Interfaces. The formal language environment can feature oneor more communications interfaces for accepting input and sending outputfrom a user. The user interface can be text-based and can include a texteditor such as Emacs. The user interface can also be a graphicalinterface that facilitates user interaction with the languageenvironment. For example, user interfaces can be deployed that allow theuser to model and analyze biological systems without awareness of theunderlying symbolic representations and operations. Icons for variousbiological elements can be displayed on a console. The user can use amouse to connect interacting elements and to generate rules, e.g., byselecting from pop-up menus for possible outcomes of the interaction.The interface translates such user selections into a rewriting rule.

[0085] In another example, wiring diagrams are used to displayinformation generated by the system. The term “wiring diagram” refers toa graphic having points, each representing an element of a system andlines interconnecting the points on the basis of relationships betweenthe elements. For example, the computer system can include a monitorthat displays a wiring diagram with points on the wiring diagramcorresponding to symbols representing biological elements. Lines betweenthe points correspond to rules that interrelate the biological elements.

[0086] The user can create and select lines and connection points with amouse and move connection points in order to formulate rules. The wiringdiagram can also be used to display the output of a rewrite process.Points corresponding to symbols present in the output state can bedisplayed with one color, whereas symbols absent in the output state canbe displayed with another color. When the rewrite process detectsmultiple alternative states, symbols present in all alternative statescan be displayed with one color, symbols absent from all alternativestates can be displayed with another color, and symbols whose presenceor absence varies among the alternative states can be displayed with athird color. Thus, the user is graphically conveyed an image of thebiological system after the rewrite process. For example, if an entirearm of one pathway is activated, this would be readily apparent as aregion of uniform color on the part of the monitor occupied by theparticular branch of the pathway. Lines can be visually renderedaccording to whether rewriting rules were utilized in a givensimulation. Further, the theorem prover and other tools can indicatewhich rewriting rules are true given certain input states.

[0087] Other Interfaces. In another example, the computer system isoutfitted with an automated crawler for searching and scanninginformation databases, such as abstracts in PubMed. The crawler can beprogrammed to identify certain key words, e.g., “phosphorylate,”“ubiquitinate,” or “secrete,” in order to identify information about therelationship of biological elements. An intelligent parser is then usedto generate rules from the identified text.

[0088] In still another example, the computer is interfaced withannotated databases that catalog biological elements and pathways.Examples of such databases include EcoCyc and Kegg (see below). Theinterface can interpret tags and fields from the database and translatethe information into rewriting rules.

[0089] Stored Models. The invention also features machine-readablememory for storing information for models of biological systems. Theinformation can include: modules, e.g., defining symbols, typing andmembership for the biological systems; rules, e.g., rewriting rules forthe biological systems; and/or state information, e.g., collections ofsymbols defining one or more biological states. The memory store canoptionally also contain a variety of other information such ashyperlinks, e.g., connecting a rule to a file with pertinent informationsuch as raw experimental data or a scientific reference. The informationcan pertain to a single biological system or multiple systems.

[0090] Inference Methods. The invention also features the efficientmanipulation of symbolic abstractions of biological systems. Thealgorithms employed include Decision Diagrams (e.g., as described inBryant, R E (August 1986) “Graph-Based Algorithms for Boolean FunctionManipulation”, Transactions On Computer, Vol. C-35: 677-691), whichallow efficient representation of many logical functions. For manyproblems, Decision Diagrams provide an exponential advantage inrepresentation and manipulation efficiency over naive conjunctive ordisjunctive normal form. The algorithms employed include transitiveclosure of hypergraphs. Transitive closure of hypergraphs representingbiological problems can be accomplished in many cases with iterativedoubling of the graph structure. In some cases this can provideexponential efficiency improvement. The algorithms employed includelattice-based representation and manipulations of states.

[0091] Biological systems can also be represented using vectors ofBoolean, discrete, or continuous values with significant efficiencygains in some cases.

System States

[0092] The state of a biological system is described using declaredsymbols. The symbols can describe the presence or absence of a componentand provide descriptive information about a component, such asmodification state, conformation, and localization.

[0093] State information can be assembled by a user, e.g., to supply aninference engine with information about a hypothetical starting state.State information is also generated as output by the inference engine,e.g., in the form of a terminal state or alternative resultant states.The input and output states can represent any possible state for abiological system. Some non-limiting examples of such states include adifferentiated state, a quiescent state, a dividing or proliferativestate, a diseased state, an injured state, an apoptotic state, a mutatedor genetically predisposed state, an infected state, and animmune-compromised state.

[0094] The information about the state of a biological system can beassembled from a variety of sources including high-throughputbioinformatics tools.

Informatics

[0095] High-throughput tools can be used analyze a whole genome and/orproteome. Such analysis can generate a large volume of data. Tools forsuch analysis are described below. Particularly useful data sets includegene expression profiles, polypeptide profiles, and protein interactionmaps. “Profiles” are data records that include values or descriptors formultiple biological elements, e.g., genes or polypeptides. For example,a “gene expression profile” can include qualitative or quantitativeinformation about the level of expression of a plurality of genes. A“polypeptide profile” can include qualitative or quantitativeinformation about the abundance and/or modification state of a pluralityof polypeptides. Profiles can provide an extensive description of thestate of a cell. Qualitative or quantitative profile information can betranslated into symbolic form and interpreted by an inference engine.

Gene Expression Profiles

[0096] Information about the expression level of multiple genes can berapidly obtained using arrays of nucleic acid capture probes (see, e.g.,Schena (1995) Science 270:467; Iyer et al. (1999) Science 283:83; DeRisiet al. (1997) Science 278:680; Lockhart and Winzeler (200) Nature405:827; Cho et al. (2001) Nature Genetics 27:48). Arrays containmultiple addresses, each dedicated to detecting the presence of aparticular transcript. The abundance of thousands of transcripts in oneor many samples can be detected.

[0097] Arrays can be fabricated by a variety of methods, e.g.,photolithographic methods (see, e.g., U.S. Pat. Nos. 5,143,854;5,510,270; and. 5,527,681), mechanical methods (e.g., directed-flowmethods as described in U.S. Pat. No. 5,384,261), pin based methods(e.g., as described in U.S. Pat. No. 5,288,514), and bead basedtechniques (e.g., as described in PCT US/93/04145). The capture probecan be a single-stranded nucleic acid, a double-stranded nucleic acid(e.g., which is denatured prior to or during hybridization), or anucleic acid having a single-stranded region and a double-strandedregion. Preferably, the capture probe is single-stranded. The captureprobe can be designed, e.g., by a computer program, to satisfy a varietyof criteria. The capture probe can be selected to hybridize to asequence rich (e.g., non-homopolymeric) region of the nucleic acid. TheT_(m) of the capture probe can be optimized by prudent selection of thecomplementarity region and length, e.g., such that the T_(m)'s of allcapture probes on the array are similar. A database scan of availablesequence information for a species can be used to determine potentialcross-hybridization and specificity problems.

[0098] The isolated nucleic acid is preferably mRNA that can be isolatedby routine methods, e.g., including DNase treatment to remove genomicDNA and hybridization to an oligo-dT coupled solid substrate (e.g., asdescribed in Current Protocols in Molecular Biology, John Wiley & Sons,N.Y). The substrate is washed, and the mRNA is eluted.

[0099] The isolated mRNA can be reverse transcribed and optionallyamplified, e.g., by rtPCR, e.g., as described in (U.S. Pat. No.4,683,202). The nucleic acid can be an amplification product, e.g., fromPCR (U.S. Pat. No. 4,683,196 and 4,683,202); rolling circleamplification (“RCA,” U.S. Pat. No. 5,714,320), isothermal RNAamplification or NASBA (U.S. Pat. Nos. 5,130,238; 5,409,818; and5,554,517), and strand displacement amplification (U.S. Pat. No.5,455,166). The nucleic acid can be labeled during amplification, e.g.,by the incorporation of a labeled nucleotide. Examples of preferredlabels include fluorescent labels, e.g., red-fluorescent dye Cy5(Amersham) or green-fluorescent dye Cy3 (Amersham), and chemiluminescentlabels, e.g., as described in U.S. Pat. No. 4,277,437. Alternatively,the nucleic acid can be labeled with biotin, and detected afterhybridization with labeled streptavidin, e.g.,streptavidin-phycoerythrin (Molecular Probes).

[0100] The labeled nucleic acid can be contacted to the array underhybridization conditions. The array can be washed, and then imaged todetect fluorescence at each address of the array. The extent ofhybridization at an address is represented by a numerical value andstored, e.g., in a database record (e.g., a table row), a vector, aone-dimensional matrix, or a one-dimensional array. For example, thedatabase record or vector has a numerical value for each address of thearray. The numerical value can be adjusted, e.g., for local backgroundlevels, sample amount, and other variations. Further a nucleic acid canalso be prepared from a reference sample and hybridized to an array(e.g., the same or a different array). The sample expression profile andthe reference profile can be compared, e.g., using a mathematicalequation that is a function of the two vectors. The comparison can beevaluated as a scalar value, e.g., a score representing similarity ofthe two profiles. Either or both vectors can be transformed by a matrixin order to add weighting values to different nucleic acids detected bythe array.

[0101] The expression data can be stored in an expression profiledatabase, e.g., a relational database such as a SQL database (e.g.,Oracle or Sybase database environments). The database can have multipletables. For example, raw expression data can be stored in one table, inwhich each column corresponds to a nucleic acid being assayed, e.g., anaddress or an array, and each row corresponds to a sample. A separatetable can store identifiers and sample information, e.g., the batchnumber of the array used, date, and other quality control information.

[0102] The similarity of a sample expression profile to a predictorexpression profile (e.g., a reference expression profile that hasassociated weighting factors for each nucleic acid) can then bedetermined, e.g., by comparing the log of the expression level of thesample to the log of the predictor or reference expression value andadjusting the comparison by the weighting factor for all nucleic acidsof predictive value in the profile.

[0103] Store or analyzed gene expression data is then used to specifythe state of a cell. For example, each gene element in a model can beassigned a state, e.g., “on” or “off.” The determination of whether agene is “on” or “off” can be made by a variety of statistical methods.For example, a numeric value for expression can be compared to thebackground detection values for genes known to be “off,” or for negativecontrols. In another example, a numeric value for expression is comparedto a corresponding value in a reference data set. Expression informationcan be also categorized by predicates, e.g., to categorize expression ofa nucleic acid as “activated,” “basal,” or “repressed.”

Regulatory Sequences

[0104] In addition to providing information about the expression levelof genes, gene profiling information can be used to identifyco-regulated genes. Data from different conditions is clustered to groupgenes that are similarly regulated in one or more samples. Such clustersare useful information for formulating rules about gene regulation.Techniques for clustering include hierarchical clustering (see, e.g.,Sokal and Michener (1958) Univ. Kans. Sci. Bull. 38:1409), Bayesianclustering, k-means clustering, and self-organizing maps (see, Tamayo etal. (1999) Proc. Natl. Acad. Sci. USA 96:2907).

[0105] In one embodiment, multiple expression profiles from replicatedata sets taken under different conditions are compared to identifynucleic acids whose expression level is predictive of the eachcondition. Each candidate nucleic acid can be given a weighted “voting”factor dependent on the degree of correlation of the nucleic acid'sexpression and the sample identity as described in Golub et al. ((1999)Science 286:531. A correlation can be measured using a Euclideandistance or the Pearson correlation coefficient.

[0106] Rules can be generated such that clustered genes are co-regulatedin the model. For example, a rule that governs one gene of the clustercan be replicated or extended to regulate additional genes of thecluster.

[0107] Information about regulatory sequences that govern transcriptionof genes is also incorporated into rules. Such information can rangefrom detailed characterization of a function of a transcription factorat a particular promoter to identification of a transcription factorbinding consensus sequence in the vicinity of a gene. Information aboutregulatory sequences can be obtained by computer-based searches and/orbiochemical experiments.

[0108] Sequence analysis programs are used to scan genomic nucleic acidsequences to identify sequences common to similarly regulated genes(see, e.g., Wolfsberg et al. (1999) Genome Research 9:755). As thebinding sites can be quite small and degenerate, statistical proceduresare used to enhance the search method. Identified binding sites can becorrelated to a known transcription factor binding sites by querying adatabase of known transcription factor binding sites. Even absent suchinformation, a rule can be generated that links coregulated genes havingthe same binding site.

[0109] In addition, biochemical experiments are used to identifyregulatory sequences. For example, Ren et al. ((2000) Science 290:2306)monitored the location of DNA binding proteins throughout the yeastgenome. DNA binding proteins were crosslinked to nucleic acid withformaldehyde; the DNA was fragmented by sonication, and bound fragmentswere isolated by immunoprecipitation with antibodies specific toDNA-binding proteins of interest. The crosslinks were reversed; thefragments were amplified and labeled, and hybridized to an arraycontaining all yeast intergenic nucleic acid sequences. This techniquecan be used as a means for generating rules or theorems. Genes thatcontain a particular binding site are likely to be regulated by thepolypeptide that recognizes the binding site.

Protein-Protein Interaction Maps

[0110] Protein-protein interactions are common features of a biologicalsystem and can be central to generating a network of interactions,particularly in signaling pathways. Matrices of protein-proteininteractions are available such as described in Walhout et al., Science287: 116-122, 2000; Uetz et al., Nature 403, 623-631, 2000); andSchwikowski (2000) Nature Biotech. 18:1257. Walhout et al. identifiedinteractions among a matrix of C. elegans vuval development signallingproteins. Uetz et al. and Schwikowski et al. identified interactionsamong a matrix of thousands of proteins identified in yeast. Thetwo-hybrid assay is used to determine if one polypeptide can bind toanother. Since the assay is performed in yeast cells using fusionproteins to a DNA binding domain and a transcriptional activator, eachyeast strain bearing a DNA binding domain fusion can be combined withyeast strain bearing the activation domain fusion using a simple matingtechnique. The assay is also easily scored by assessing reporter genetranscription. Each observed protein interaction can be used to generatean interaction rule, or a testable theorem. For example, if protein Astably binds to protein B, a rewriting rule can be used to replace theinstance of protein A and B with a protein complex A:B.

[0111] Interactions that are not observed can be due to failure of theassay to accurately detect a physical interaction, e.g., the assay maynot identify all possible interactions for transmembrane proteins.

[0112] Interactions can be determined in a variety of conditions. Forexample, each protein-protein interaction can be observed in a differentgenetic environment, e.g., a different yeast host strain or with allelicvariants of the interaction partners. Each protein-protein interactioncan also be observed while the cells are exposed to an exogenous agent,e.g., a small organic compound such as a drug or drug candidate.Information about such interactions can be used to build additionalrules, for example, conditional rewriting rules.

Polypeptide Arrays

[0113] Polypeptide arrays can be used to obtain data about a cell state.Such arrays can be used to rapidly assay many different polypeptides.For example, an array can be contacted with a substrate, a ligand, or anenzyme to identify interactions.

[0114] A low-density (96 well format) protein array has been developedin which proteins are spotted onto a nitrocellulose membrane Ge, H.(2000) Nucleic Acids Res. 28, e3, I-VII). A high-density protein array(100,000 samples within 222×222 mm) used for antibody screening wasformed by spotting proteins onto polyvinylidene difluoride (PVDF)(Lueking et al. (1999) Anal. Biochem. 270, 103-111). Polypeptides can beprinted on a flat glass plate that contained wells formed by anenclosing hydrophobic Teflon mask (Mendoza, et al. (1999). Biotechniques27, 778-788.). Also, polypeptide can be covalently linkled to chemicallyderivatized flat glass slides in a high-density array (1600 spots persquare centimeter) (MacBeath, G., and Schreiber, S. L. (2000) Science289, 1760-1763). De Wildt et al., describe a high-density array of18,342 bacterial clones, each expressing a different single-chainantibody, in order to screening antibody-antigen interactions (De Wildtet al. (2000). Nature Biotech. 18, 989-994). These known methods andother can be used to generate an array of antibodies for detecting theabundance of polypeptides in a sample. The sample can be labeled, e.g.,biotinylated, for subsequent detection with streptavidin coupled to afluorescent label. The array can then be scanned to measure binding ateach address.

Proteomics

[0115] Proteomics includes a large set of tools for the analysis ofpolypeptides in a sample. These tools can be used to monitor expressionlevel, post-translational modifications, enzymatic activity,protein-protein interactions, and evolutionary relationships.

[0116] Additional sources for rules include computational methods foridentifying interactions between elements of biological systems.Examples of such methods include the comparative genome and phylogeneticanalysis described in Pellegrini et al. (1999) Proc. Natl. Acad. Sci.USA 96:4285 and Marcotte et al. (1999) Nature 402:83. Interactionspredicted by computational methods can be verified using a theoremprover.

Mass Spectroscopy

[0117] Mass spectroscopy can be used to obtain accurate analysis of theidentity and/or modification state of a polypeptide species. Forexample, a polypeptide can be fragmented with a site-specific protease(e.g., trypsin, chymotrypsin, or subtilisin), combined with a matrix,and then excited with a laser. Its flight through the mass spectroscopyinstrument is analyzed to determine its molecular weight with highaccuracy. The molecular weights of proteolytic fragments provide anaccurate fingerprint of the polypeptide or of a pool of polypeptides.Changes in the molecular weight of peptide fragments can be associatedwith post-translational modification. See, e.g., Shevchenko et al.(2000) Anal. Chem. 71:2132-2141 for a review.

[0118] Mass spectroscopy can be combined with 2-dimensional gelelectrophoresis to provide a polypeptide profile of a cell or sample.Polypeptides from the cell are sample are separated in an acrylamide gelby isoelectric pont and molecular weight. Then different addresses ofthe gel are analyzed as described above. Information about isoelectricpoint, molecular weight, and proteolytic fragment sizes can be stored indatabase records. This information can be used to generate informationabout the state of a biological system. Alternatively, this informationcan be compared to a similar analysis of a reference sample or cell. Theresult of the comparison can be used to generate a profile of thepolypeptides in a cell. The profile can include information such as“protein X is phosphorylated” and “protein Y is proteolyticallyprocessed.”

Metabolic Pathways

[0119] Rules can be generated from curated information about metabolicpathways, e.g., from textbooks, experimental observations, or databases.For example, the EcoCyc database (Karp et al. (2000) Nucl Acids Res28:56-59) is a computer database of metabolic pathways in E coli. Thedatabase includes computer-readable and portable representations (e.g.,ontologies) of biological functions for nucleic acid and amino acidsequences. In one version, the EcoCyc database included 744 reactionscatalyzed by 607 enzymes, many of which are multifunctional. Thereactions were organized into 131 pathways. This metabolic map described791 chemical substrates. Also included are an additional 161 reactionsinvolving macromolecule metabolism, e.g., DNA replication and tRNAcharging.

[0120] Information from the database can be translated into symbols andimported into a rule generator.

[0121] The database includes entries for components and steps ofpathways, e.g., a metabolic pathway that includes a step for convertingmetabolite X into metabolite Y using enzyme Z. Such a conversion can beexpressed by the following rewriting rule:

XZ→YZ.

[0122] Other databases of metabolic and signaling pathways include theWIT server (http://www-unix.mcs.anl.gov/compbio/), and KEGG (Ogata etal. (1998) BioSystems 47:119; http://www.genome.ad jp/kegg/).

Uses

[0123] The models, theorem provers, and inference engines described hereare versatile tools for the analysis of biological systems. Non-limitingexamples of their use include the generation of testable hypotheses,identification of drug targets, engineering biological circuits,diagnostics, and education.

[0124] The models can be of biological systems of any species of life,particularly, of human, mouse, rat, Drosophila melanogaster,Caenorhabditis elegans, Danio rerio, Arabidoposis thaliana, Oryzasativa, Zea mays, Saccharomyces cerevisiae, Escherichia coli, Salmonellatyphimurium, and Mycoplasma genitalium.

[0125]Mycoplasina genitalium is notable as being one of the simplestcellular life forms. The 580 kilobase genome of Mycoplasma genitaliumwas determined (Fraser et al. (1995) Science 270:397). The genomeencodes only 480 polypeptide species, and 37 RNA species. By transposonsmutagenesis, Hutchison et al. ((1999) Science 1286:2165) have determinedthat only 265 to 350 of these genes are likely to be essential forsupporting life. This genetic information and information from on-goingefforts to characterize the genes of this minimalist system can be usedas a prototypic model for a biological model of a cell. Rules andinferences obtained from this model can also be applied to models ofmore complexes systems. Also an animated model based on a rule set forMycoplasma genitalium can be used as an interactive educational tool

Hypothesis Generation

[0126] In one exemplary application of the methods described here, rulesand a hypothetical input state of a biological system are processed byan inference engine. The engine identifies one or more resultant states.Each of these resultant states can regarding as a testable hypothesis.An experimentalist can verify such results using a laboratory model ofthe biological system. The results of the laboratory experiments can beused to develop additional rules about the system. For example, theresults can be fed into an inference engine, e.g., along with previousinformation, to obtain possible resultant states. If the experimentalresults are able to differentiate between the previous set of resultantstates, this second implementation of the inference engine results infewer alternative resultant states. This cycle of hypothesis generationand testing can be repeated until only one or a select few alternativeresultant states are identified.

[0127] In addition, results from experimental tests can be used bytheorem prover functions of the inference engine in order to identifyand/or test new rules. Thus, each additional experimental resultproduces additional rules that describe the system.

[0128] Further, the inference engines described here can be used toidentify the properties of rules sets. For example, the rules can beanalyzed to determine if they are terminating and/or if they areChurch-Rosser. Such analysis can be useful for identifying intricatefeedback and feedforward loops, e.g., such loops may not be readilyapparent from a wiring diagram.

Diagnostics

[0129] The methods described here can also be used in diagnosticapplications. Information about a sample from a patient is used tosupply an initial state for the inference engine. The information caninclude information about gene expression, polypeptide modificationstate, and genetic variations. The inference engine then computespossible alternative resultant states using rules, e.g., rulespertaining to the type of tissue sample used. At least some of the rulescan include relationships between genetic variations in the elements(e.g., mutations, SNPs, translocations, and trinucleotide repeats) andtheir function in the system. The alternative resultant states representpossible diagnoses for the patient. These alternatives can suggestadditional information to supply, e.g., other symptoms, to the system inorder to refine the diagnosis. In addition, complex rule setsrepresenting rules for multiple different cell types can be used inorder to model possible disorders involving multiple cell types.

[0130] For example, the models can be used to predict disorderscharacterized by changes in cell proliferation (e.g., cancers), celldifferentiation, cell adhesion (e.g., metastatic cancers), hormonelevels (e.g., metabolic and neurological disorders), and so forth.

Drug Discovery

[0131] The methods described here can be used to identify drug targets.For example, the states of a normal and a diseased cell are compared bythe inference engine. The diseased cell, such as a cancer cell, is usedas an initial state for the cell system. The initial state can begenerated from observations such as a gene expression profile obtainedfrom a sample of diseased cells. The final state can be generated fromsimilar observations of a normal cell. The methods described here areused to identify one or more elements of the diseased system that, whenaltered, cause the diseased state to transition to the normal state.

[0132] Such elements are potential drug targets as alteration of theirstate results causes the system to return to a desired state.Accordingly, a drug that similarly alters the properties of theidentified element should cause the diseased cell to become normal.

Engineering Biological Circuits

[0133] The rules can be used to design artificial regulatory circuits.Genetic engineering has been used to construct transcriptionalregulatory circuits and synthetic proteins with new properties. Suchcircuits are useful for creating cell-based biosensors which can senseenvironmental changes and intelligent cell-based therapeutic deliverysystems, e.g., recombinant cells producing a therapeutic polypeptidesuch as a humanized antibody or a polypeptide hormone.

[0134] Gardner et al. (2000) Nature 402:339) and Becskel & Serrano etal. (2000) Nature 405:590 describe the design of artificial biologicalcircuits that function as stable genetic switches.

[0135] The design of biological circuits is facilitated by the inferenceengines and methods described here since these tools can be used toidentify rules that are necessary for a contemplated circuit. The usercan then construct the necessary recombinant molecules to implement therule. Such recombinant molecules can include chimeric signalingmolecules (e.g., ones that fuse an adaptor from one pathway with anenzyme of another pathway), chimeric transcription regulatory sequences(e.g., combinations of binding sites for different transcriptionalregulators), chimeric transcription factors, and artificial promoters.

[0136] Alternatively, the methods described here are used to testwhether a contemplated circuit would function as designed. Rules aregenerated based on the circuit design and are supplied to an inferenceengine in combination with one or more hypothetical initial state. Theoutcome of the circuit is vigorously simulated under a variety ofconditions to determine if it is operating as expected beforeimplementation.

Pathogens

[0137] The interaction of a pathogen with its host provides anadditional example of the diagnostic and analytic application of themethods described here. Rules that describe the circuitry of pathogenmolecules are combined with rules that describe the circuitry of hostcells to model pathogenic events. Examples of pathogens suitable forthis analysis include viruses (e.g., retroviruses such as HIV, DNA-tumorviruses, adenoviruses, herpes viruses, and bacteriophages, bacteria suchas Gram-negative and Gram-positive bacteria, protists such as Plasmodiumfalciparum, fungi, and metazoans (e.g., filarial nematodes).

[0138] The model of host and pathogen can be used to identify key hostand pathogenic elements which are potential drug targets. Further, thesystem can be analyzed to identify host cell states that require thepresence of a pathogen. Such host cell states are used as references indiagnostic tests for determining if a pathogen is present or active.

[0139] In viruses, the host and pathogen are particularly intimatelyassociated as the virus enters cells and utilizes host cell factors.Rules can for example test for the presence of appropriate cell surfacereceptors and co-receptors for viral entry. In addition, the cell hasits own response to invasion. Double-stranded RNA, for example, isdetected and activates dsRNA-dependent kinase and subsequent signalingevents. Such signals can result in interferon-γ production. Rulesdescribing additional cells such as immune cells can also be included.

Some Illustrative Signaling Elements

[0140] Many cellular signaling networks feature the sensing of anextracellular signal, the activation of a cytoplasmic sequence ofsignaling events (such as a cascade of kinase activation), and theregulation of gene transcription.

[0141] Kinase cascades. The presence of an extracellular signalfrequently elicits the activation of an intracellular kinase cascade.For example, the binding of BDNF factor to its receptor, a tyrosinekinase receptor, activates binding of the adaptor molecule GRB2 to thereceptor. GRB2 recruits the guanine nucleotide exchange factor (GEF) Sosto the complex. Sos causes the guanine nucleotide binding protein, Ras,to release GDP and bind GTP. The GTP bound state of Ras activates theserine-threonine kinase Raf. Raf activates the kinases MEK1 and MEK2,which activate the MAPKK (MAP kinase kinases), which activate MAPkinases 1 and 2. MAP kinases 1 and 2 phosphorylate the transcriptionfactor Elk-1, thus causing changes in gene regulation

[0142] Transcription Factors. Many well characterized transcriptionalfactors and their nucleic acid binding sites function as signalintegrators.

[0143] For example, interferon-β gene is activated by threepolypeptides, ATF-2, NF-κB, and IRF-1, in response to viral infection.Double-stranded RNA from the infecting virus separately triggers each ofthe transcription factors. However, to insure the fidelity of theresponse, all three factors must be active for transcription to ensue.The biochemical basis for this switch is a synergistic binding of thepolypeptides to a segment of the promoter (reviewed in Maniatis et al.(1992) In Transcriptional Regulation Vol. 2, pp. 1193-1220 Cold SpringHarbor Press, Cold Spring Harbor N.Y.). Although each polypeptide alonecan bind weakly to the segment, protein-protein interactions among thethree result in highly cooperative binding (Du et al. (1993) Cell74:887; Thanos and Maniatis (1995) Cell 83:1091). This example isillustrative of the role of information about physical interactions suchas protein-protein interactions and protein-nucleic acid interactions inmodeling a system. This system can be modeled as a rewriting rule thatreplaces activated ATF-2, NF-κB, and IRF-1 with interferon-β.

[0144] Another gene that processes two independent signals at itspromoter is the lac operon of E. coli. The lac operon which includes thelacZYA genes of E. coli is only transcribed in the presence of lactoseand the absence of glucose. Either condition alone does not suffice. Inthe absence of lactose, the lac repressor inhibits lac transcription bypreventing clearance of RNA polymerase (Gralla (1992) In TranscriptionalRegulation Vol. 2, pp. 629-642 Cold Spring Harbor Press, Cold SpringHarbor N.Y.). In the presence of lactose, an inducer, allolactose, whichis a secondary metabolite from lactose, binds to the lac repressor anddecreases its affinity for DNA. However, loss of lac repressor bindingis not sufficient for lac operon transcription. The cap activatorprotein must be activated by cAMP, a small molecule indicator for theabsence of glucose.

[0145] The spatial and temporal regulation of stripes of gene expressionin the Drosophila embryo is also the manifestation of the cell's abilityto process complex rules. The even-skipped segmentation gene iscontrolled by multiple enhancer regions, each directing expression of astripe of gene expression. The eve stripe 2 enhancer (Arnosti et al.(1996) Development 122:205) is a 500 basepair DNA site which bindsmultiple factors, including Bicoid and Hunchback, which togetheractivate the promoter, and Giant and Kruppel which repress the promoter.The cells of stripe 2 are located in a zone of high Bicoid activity, amorphogen present at high concentrations at the anterior of the embryo.The front border of the stripe 2 enhancer, in contrast is the result ofrepression by anterior stores of the Giant repressor. The posterior edgeis limited by Kruppel expression. Thus, the convergence of multipletranscription factors on a regulatory element of a promoter generates aspatially restricted pattern of gene expression. The combinatorialcontrol by these various factors can be implemented as a rewriting ruleof the biological system.

[0146] Neuromuscular Junction. Interactions between cells can bemodeled. For example, the interactions between a neuron and a musclecell at neuromuscular junctions are precisely regulated. Theinteractions include the arrival of an action potential in the axon ofthe neuron at the synapse. This action potential can open voltage gatedCa²⁺ channels. The Ca²⁺ signal can trigger synaptic vesicle membraneproteins to interact with plasma membrane proteins in order to cause therelease of neurotransmitters, such as acetylcholine, in the synapticvesicle into the synapse. The muscle cell has surface receptors for theneurotransmitter. The acetylcholine receptor is a ligand-gated ionchannel that has an open and closed conformation. The binding ofacetylcholine to the receptor triggers a conformational change fromclosed to open, thus allowing small cations to enter the muscle cell anddepolarize the membrane. This generates a propagating action potentialin the muscle cell that results in muscular contraction.

[0147] Additional mechanisms operate to desensitize and or reset thesystem. Acetylcholinesterases in the cleft inactivate and destroy thesecreted acetylcholine. In another adaptive response, Ca²⁺ open channelsthat let potassium ions enter the axon and terminate Ca²⁺ influx, thusending the action potential.

EXAMPLE

[0148] The Maude language was used to model the control of the tumorsuppressor protein pRB and its interactions with associated cell-cycleregulators, cycD, cycE, cdk4, cdk2, E2F1, and DP1. Such interactions aredepicted in the wiring diagrams of FIG. 3A and 3B. Kohn ((1999) Mol.Biol. of the Cell 10:2703-2734) provided similar wiring diagrams of acell cycle regulatory network that includes these components. The Kohnmap is a synthesis of numerous published experimental results. Indicatedon the map are multiprotein complexes, gene promoters, DNA damageevents, enzymes, e.g., DNA repair enzymes, and DNA modification enzymes.

[0149] The Appendix of U.S. Provisional Application No. ______, filedMay 10, 2001, titled “Modeling Biological Systems,” naming Patrick D.Lincoln and Keith R. Laderoute as inventors, provides an example ofMaude code for modeling cellular behavior.

[0150] Components of the cell-cycle regulation system were first definedusing hierarchical types. What follows is an example of Maude code fordeclaring symbols for this biological system: TABLE 2 Maude CodeDeclaring General System Components fmod MODIFICATION is pr MACHINE-INT. sorts Site Modification ModSet AminoAcid Protein . subsortModification < ModSet . ops glycine alanine valine leucine isoleucineproline : −> AminoAcid . ops serine threonine cysteine methionineasparagine : −> AminoAcid . ops glutamine phenylalanine tyrosinetryptophan lysine : −> AminoAcid . ops arganine histidine asparateglutamate selenocysteine : −> AminoAcid . op AminoAcidSite : AminoAcidMachineInt −> Site . op phos : Site −> Modification . op acetyl : Site−> Modification . op ubiq : Site −> Modification . op none : −> ModSet .op _ : ModSet ModSet −>ModSet [assoc comm id: none] . op _contains_ :ModSet Modification −> Bool . var M M′ Modification . var MS : ModSet .eq none contains M′ = false . eq (M MS) contains M′ = if M == M′ thentrue else MS contains M′ fi . endfm fmod PROTEIN is pr MODIFICATION .sort Protein . op [_ | _] Protein ModSet −> Protein [right id: none] .endfm fmod COMPLEX is pr PROTEIN . sort Complex . subsorts Protein <Complex . op _ :_ : Complex Complex −> Complex [comm] . endfm fmod SOUPis pr COMPLEX . sort Soup . subsort Complex < Soup . op _ : Soup Soup −>Soup [assoc comm] . endfm

[0151] The first module (MODIFICATION) declares symbols of the type“Site” “Modification” “ModSet” “AminoAcid” and “Protein”. “Modification”is a type of “ModSet”. These symbols are generally useful for describingproteins and features of their sequence, particularly amino acidmodifications.

[0152] “AminoAcid” can be any of the twenty amino acids.

[0153] “AminoAcidSite” is a type that is classified as a “Site” that hasan amino acid and an integer (MachineInt) associated with it.

[0154] Next, three different post-translational modifications aredeclared: “phos” for phosphorylation, “acetyl” for acetylation, and“ubiq” for ubiquitination.

[0155] The next module (PROTEIN) declares an operator that predicatesprotein symbols with modifications.

[0156] The module (COMPLEX) declares the sort “Complex.” Forconvenience, a protein, e.g., a monomer protein, is also declared as a“Complex,” i.e., a complex of one species. The operator having a colon(“_:_”) is defined, such that A:B refers to a complex of A and B. Thisoperation is commutative.

[0157] The module (SOUP) was similarly constructed and refers to anenvironment having multiple complexes. TABLE 3 Maude Code Declaring CellCycle Control Elements mod K5 ls inc SOUP . *** constants to representunmodified proteins ops cycD cycE cdk4 cdk2 pRb E2F1 DP1 : −> Protein .*** [1] Cyclin D binds to Cdk4 forming the complex CycD:Cdk4 r1 cycDcdk4 => cycD : cdk4 . *** [2] Cyclin E binds to Cdk2 forming the complexCycE:Cdk2 r1 cycE cdk2 => cycE : cdk2 . *** [3] CycD:Cdk4 phosphorylatespRb at site D op D : −> Site . r1 pRb (cycD : cdk4) => [pRb | phos(D)](cycD : cdk4) . *** [4] CycE:Cdk2 phosphorylates pRb-P(D) at site E. opE : −> Site . r1 [pRb | phos(D)] (cycE : cdk2) => [pRb | phos (D) phos(E)] (cycE : cdk2) . *** [5] E2F1 binds to DP1 r1 E2F1 DP1 => E2F1 : DP1. *** [6] Fully phosphorylated pRb cannot bind to E2F1:DP1 var M :ModSet . crl [pRb | M] (E2F1 : DP1) => [pRb | M] : (E2F1 : DP1) ifnot((M contains phos(D)) and (M contains phos(E))) . endm rew cycD cycEcdk4 cdk2 pRb E2F1 DP1 *** *** Regulation of G1-S transition *** mod G1Sis inc SOUP . *** [7] constants to represent unmodified proteins opscycD cycE cdk4 cdk2 pRb E2F1 DP1 : −> Protein . sorts SignalMitogenicSignal SPhaseEntrySignal . subsorts MitogenicSignalSPhaseEntrySignal < Signal . *** [8] Signals are Declared subsort SignalComplex . op TGFbeta : Signal . op replicate′ : −> MitogenicSignal . opentersphase! : −> SPhaseEntrySignal . var MITOGENICSIGNAL :MitogenicSignal . sorts CipKip CKI . subsorts CipKip CKI < Protein . opsp21Cip1 p27Kip1 p57Kip2 : −> CipKip . var CIPKIP : Cipkip . ops p15 p16ink4a ink4b : −> CKI . var CKIv : CKI . op D : −> Site . *** [9] CyclinD binds to Cdk4 forming the complex CycD:Cdk4 r1 cycD cdk4 => cycD :cdk4 . *** [10] MitogenicSignal prods cyclin complex to sequester cipkipr1 MITOGENICSIGNAL cycD : cdk4 CIPKIP => (cycD : cdk4) : CIPKIP . ***[11] TGFbeta produces Ink4a and Ink4b r1 TGFbeta => Ink4a Ink4b . ***[12] ink4 ruins the big complex, marks cycD for destruction r1 CKIv(cycD : cdk4) : CIPKIP => (CKIv : cdk4) CIPKIP [cycD : phos (D)] . ***[13] phosphorylated cycD is destroyed r1 [cycD : phos(D)] => cytoplasm .*** [14] almost an identity r1 cytoplasm cytoplasm => cytoplasm . ***[15] cyclin E CDK2 complex is inactivated by CipKip. r1 (cycE : cdk2)CIPKIP => (cycE : cdk2) : CIPKIP . *** [16] CyclinD dependent kinasecomplex sequentially phosphorylates pRb r1 ((cycD : cdk4) : CIPKIP) (pRb: E2F1) => ((cycD : cdk4) : CIPKIP) [pRb | phos(D)] E2F1 . *** [17] E2F1activates transcription of genes for Cyclin E r1 E2F1 => E2F1 cycE . ***[18] E2F1 throws cell into S phase r1 E2F1 => E2F1 entersphase! . ***[19] Cyclin E and CDK2 form a complex r1 cycE cdk2 => cycE : cdk2 . endm

[0158] Rules (indicated by “rl”) were created to describe interactionsamong cell cycle regulatory proteins as shown above. Constants wereintroduced for the representation of cycD 310, cyce 320, cdk4 312, cdk2322, pRB 330, E2F1 360, and DP1 362. Rules were used to indicate thespecificity of cyclins for particular cyclin-dependent kinases: cycD 310binding to cdk4 312, and cycE 320 binding to cdk2 322.

[0159] Cyclin D•cdk4 314 phosphorylates pRB 330 at multiple amino acidpositions. Site D 332 was used to represent this group of amino acids. Arewriting rule ([3]) was formulated to indicate that if pRB and cyclinD•-cdk4 are present then site D of pRB is modified by phosphorylated. Asimilar rewriting rule ([4]) was generated for site E 334phosphorylation by cycE•cdk2. This rule requires that site D isphosphorylated in order for site E to be phosphorylated.

[0160] A conditional rule (indicated by “crl”) 336 was formulated tospecify that if pRB is phosphorylated at both site D and site E, itcannot bind to the E2F1•DPi complex.

[0161] Rules were also generated to model the G1 to S phase transition.A rewriting rule ([10]) was generated for the binding of the inhibitorCipKip to the cyclin D-CDK4 (cycD:cdk4) complex. This complex is onlyformed when a MITOGENICSIGNAL is present. When the growth factor TGF-βis present, the cyclin-dependent kinase inhibitors (CKI) Ink4a and Ink4bare produced (Rule [11]). Ink4a and Ink4b were typed as CKIs, thus rulescould be generated for CKIs without having to specify the exact CKIspecies. Rule [12] modeled the binding of the CKIs (e.g., Ink4a andInk4b) to the complex of cycD:cdk4:CIPKIP, which results in cycDphosphorylation and then destruction (rule [13]).

[0162] Rule [16] was formulated to express the phosphorylation of pRB330 by active cyclin D:cdk4 314 complexes. This phosphorylation disruptsthe pRB:EF2:DP1 complex 372 and allows E2F1:DP1 364 to activate 394transcription of cyclinE 390 (Rule [17]) and cause cells to enter Sphase (Rule [18]). Rule [19] models the formation of cyclin E-cdk2complexes, and rule [15] models the inactivation of cyclin E:cdk2complexes by CIPKIP.

[0163] Models such as the one described here have been used to runsimulations of biological systems using symbolic representation. Outputincluded determination of all possible states leading to the G1/Scheckpoint in a mammalian cell, and determination of all possibleresultant states given a mitogenic signal outside the cell.

Other Embodiments

[0164] Other embodiments are within the scope of the following claims.

[0165] For example, multiple different cells may be modeled, e.g., acell for each major organ and cell-type in a subject. Such a system caninclude general rules and symbols for components common to all cells ofthe subject, as well as rules and symbols particular to a cell-type.Thus, the general response of an organism to disease, genetic mutation,or environmental change can be simulated and studied. Further theprocess of development and differentiation can be modeled in theorganism, e.g., from an embryo to adult stages.

[0166] In another example, the rules and symbols may be used to modeleach individual cell of a population. For example, a system formonitoring pancreatic behavior might include rules and symbols for eachof 10,000 α islet cells, 5,000β islet cells, and 1,000 γ islet cells.

What is claimed is:
 1. A method comprising: generating a model of abiological system, the model comprising rules that express asubstitution of at least one symbol by at least another symbol, thesymbols representing a biological element, and at least some of therules being expressed in a manner that enables an inference engine toinfer alternative results from the system based on an initialhypothetical state.
 2. The method of claim 1 wherein one or more of therules comprises an operator for expressing a relationship between atleast two of the biological elements, the operator conforming toassociative and commutative properties.
 3. The method of claim 1 whereinone or more of the miles expresses concurrent state transitions.
 4. Themethod of claim 1 wherein at least some of the rules are notterminating.
 5. The method of claim 1 wherein at least one of the rulesrepresents a feedback or feedforward interaction between biologicalelements.
 6. The method of claim 1 wherein one or more of the rules isreflective.
 7. The method of claim 1 wherein one or more of the symbolsrepresenting the biological elements is typed.
 8. The method of claim 7wherein the types of symbols are organized in hierarchical classes. 9.The method of claim 8 wherein a symbol for one of the hierarchicalclasses is matched by any symbol that is a member of the hierarchicalclass.
 10. The method of claim 1 wherein at least some of the rules areconditional.
 11. The method of claim 1, further comprising expressingthe rules graphically by representing at least some of the symbols aspoints and at least some of rules as lines interconnecting points, eachinterconnected point corresponding to a symbol that is an operand of therule.
 12. The method of claim 1 wherein one or more of the symbolsrepresents a polypeptide selected from the group consisting of a proteinkinase, a transcription factor, a cytokine, and a nucleotide bindingprotein.
 13. The method of claim 1 wherein one or more of the symbolsrepresents a polypeptide selected from the group consisting of pRB,cyclins, cyclin-dependent kinases, cyclin-dependent kinase inhibitors,p53, E2F, and DP1.
 14. The method of claim 1 wherein one or more of thesymbols represents a drug or exogenous agent.
 15. The method of claim 1wherein one or more of the symbols represents post-translationalmodification.
 16. The method of claim 1 wherein the model of thebiological system includes a first set of symbols representing moleculesin a first cell and a second set of symbols representing molecules in asecond cell.
 17. The method of claim 16 wherein one or more of the firstset of symbols comprises the same symbols of the second set.
 18. Anarticle comprising machine-readable media having encoded thereon a modelof a biological system, the model comprising rules that express asubstitution of at least one symbol by at least another symbol, thesymbols representing a biological element, and at least one of the rulesbeing expressed in a manner that enables an inference engine to inferalternative results from the system based on an initial hypotheticalstate.
 19. The article of claim 18 wherein one or more of the rulescomprises an operator for expressing a relationship between at least twoof the biological elements, the operator conforming to associative andcommutative properties.
 20. The article of claim 18 wherein one or moreof the rules expresses concurrent state transitions.
 21. The article ofclaim 18 wherein at least some of the rules are not terminating.
 22. Thearticle of claim 18 wherein at least one of the rules represents afeedback or feedforward interaction between biological elements.
 23. Thearticle of claim 18 wherein one or more of the rules is reflective. 24.The article of claim 18 wherein one or more of the symbols representingthe biological elements is typed.
 25. The article of claim 24 whereinthe types of symbols are organized in hierarchical classes.
 26. Thearticle of claim 25 wherein a symbol for one of the hierarchical classesis matched by any symbol that is a member of the hierarchical class. 27.The article of claim 18 wherein at least some of the rules areconditional.
 28. The article of claim 18 wherein one or more of thesymbols represents a polypeptide selected from the group consisting of aprotein kinase, a transcription factor, a cytokine, and a nucleotidebinding protein.
 29. The article of claim 18 wherein one or more of thesymbols represents a polypeptide selected from the group consisting ofpRB, cyclins, cyclin-dependent kinases, cyclin-dependent kinaseinhibitors, p53, E2F, and DP1.
 30. The article of claim 18 wherein oneor more of the symbols represents a drug or exogenous agent.
 31. Thearticle of claim 18 wherein one or more of the symbols representspost-translational modification.
 32. The article of claim 18 wherein themodel of the biological system includes a first set of symbolsrepresenting molecules in a first cell and a second set of symbolsrepresenting molecules in a second cell.
 33. The article of claim 32wherein one or more of the first set of symbols comprises the samesymbols of the second set.
 34. A method comprising: receiving a set ofsymbols in an inference engine, the set representing a hypotheticalinitial state of a biological system, the symbols representing elementsof the biological system; and processing the initial state using rulesthat express a substitution of at least one of the symbols by at leastanother symbol representing a biological element to infer alternativeresultant states of the system.
 35. The method of claim 34 wherein theset of symbols representing the hypothetical initial state is generatedfrom an expression profile for a biological sample.
 36. The method ofclaim 34, further comprising: parsing a profile for a biological sampleinto symbols; and include at least some of the symbols in the set ofsymbols representing a hypothetical initial state of the biologicalsystem.
 37. The method of claim 36 wherein the profile is a geneexpression profile.
 38. The method of claim 36 wherein the profile is apolypeptide profile.
 39. The method of claim 36 wherein the biologicalsample is associated with a disease or disorder.
 40. The method of claim39 wherein the disease or disorder is selected from the group consistingof cancer, diabetes, infection by a pathogen, inflammation, and adisease of aging.
 41. The method of claim 34 wherein infinitesubstitution chains are detected.
 42. The method of claim 34 whereinvalues of one or more of the symbols of the resultant states aredisplayed graphically as a wiring diagram.
 43. The method of claim 42wherein the wiring diagram comprises a graph having linesinterconnecting points, each line corresponding to a rule such that eachinterconnected point of the line corresponds to a symbol that is anoperand of the rule.
 44. The method of claim 34, further comprising:comparing each of the alternative resultant states to one or morereference states.
 45. The method of claim 44 wherein the one or morereference states comprise a state associated with cell proliferation,cell quiescence, cell apoptosis, and cell differentiation.
 46. Themethod of claim 44 wherein the alternative resultant states are comparedto two or more reference states, each reference state being associatedwith a diagnosis.
 47. The method of claim 44 wherein the hypotheticalinitial state represents a sample from a patient.
 48. The method ofclaim 34 wherein the set of symbols representing hypothetical initialstate comprises a symbol representing a genetic alteration.
 49. Themethod of claim 44 wherein the one or more reference states comprise astate associated with a disease or disorder.
 50. The method of claim 49wherein the disease is selected from the group consisting of cancer,diabetes, infection by a pathogen, inflammation, and a disease of aging.51. The method of claim 34 wherein one or more of the rules comprises anoperator for expressing a relationship between at least two of thebiological elements, the operator conforming to associative andcommutative properties.
 52. The method of claim 34 wherein one or moreof the rules expresses concurrent state transitions.
 53. The method ofclaim 34 wherein at least some of the rules are not terminating.
 54. Themethod of claim 34 wherein at least one of the rules represents afeedback or feedforward interaction between biological elements.
 55. Themethod of claim 34 wherein one or more of the rules is reflective. 56.The method of claim 34 wherein one or more of the symbols representingthe biological elements is typed.
 57. The method of claim 56 wherein thetypes of symbols are organized in hierarchical classes.
 58. The methodof claim 57 wherein a symbol for one of the hierarchical classes ismatched by any symbol that is a member of the hierarchical class. 59.The method of claim 34 wherein at least some of the rules areconditional.
 60. The method of claim 34 wherein one or more of thesymbols represents a polypeptide selected from the group consisting of aprotein kinase, a transcription factor, a cytokine, and a nucleotidebinding protein.
 61. The method of claim 34 wherein one or more of thesymbols represents a polypeptide selected from the group consisting ofpRB, cyclins, cyclin-dependent kinases, cyclin-dependent kinaseinhibitors, p53, E2F, and DP1.
 62. The method of claim 34 wherein one ormore of the symbols represents a drug or exogenous agent.
 63. The methodof claim 34 wherein one or more of the symbols representspost-translational modification.
 64. The method of claim 34 wherein themodel of the biological system includes a first set of symbolsrepresenting molecules in a first cell and a second set of symbolsrepresenting molecules in a second cell.
 65. The method of claim 64wherein one or more of the first set of symbols comprises the samesymbols of the second set.
 66. A method comprising: receiving a set ofsymbols in an inference engine, the set of symbols representing ahypothetical initial state of a biological system, the symbolsrepresenting biological elements of the system; and iterativelysubstituting at least one of the symbols by at least another symbolrepresenting a biological element using rules that representinteractions between the biological elements until a terminal state isdetected or until alternative resultant states are detected.
 67. Themethod of claim 66, further comprising outputting the terminal state orat least one of the alternative resultant states.
 68. The method ofclaim 66 wherein the hypothetical initial state represents a biologicalsample from a patient.
 69. The method of claim 68 wherein the biologicalsample is associated with a disease or disorder.
 70. The method of claim69 wherein the disease or disorder is selected from the group consistingof cancer, diabetes, infection by a pathogen, inflammation, and adisease of aging.
 71. The method of claim 66, further comprising parsinga profile for a biological sample into symbols; and include at leastsome of the symbols in the set of symbols representing a hypotheticalinitial state of the biological system.
 72. The method of claim 66,further comprising: comparing each of the alternative resultant statesto one or more reference states.
 73. The method of claim 72 wherein theone or more reference states comprise a state associated with cellproliferation, cell quiescence, cell apoptosis, and celldifferentiation.
 74. The method of claim 72 wherein the alternativeresultant states are compared to two or more reference states, eachreference state being associated with a diagnosis.
 75. The method ofclaim 66 wherein the set of symbols representing hypothetical initialstate comprises a symbol representing a genetic alteration.
 76. Themethod of claim 66 wherein one or more of the symbols representing thebiological elements is typed.
 77. A method comprising: receiving into aninference engine a rule set comprising rules that express a substitutionof one or more of the symbols representing biological elements by atleast another symbol representing a biological element; and determininga property of the rule set.
 78. The method of claim 77 wherein theproperty comprises an indicator of whether the rules set is terminating.79. The method of claim 77 wherein the property comprises an indicatorof whether the rule set includes one or more rules expressing a feedbackor feedforward interaction.
 80. The method of claim 77 wherein thedetermining comprises associative-commutative matching.
 81. The methodof claim 77 further comprising generating a decision diagram.
 82. Amethod comprising: receiving into an inference engine (1) at least afirst and a second set of symbols wherein the first set of symbolsrepresents a hypothetical first state of a biological system, and thesecond set of symbols represents a hypothetical second state of thebiological system, and the symbols represent biological elements of thebiological system, and (2) rules that express a substitution of one ormore of the symbols representing biological elements by at least anothersymbol representing a biological element; and determining if one or moreof the rules must be true or false for the first state to reach thesecond state by processing the first state using the rules.
 83. Themethod of claim 82 wherein the hypothetical first state represents ahypothetical reference sample, the hypothetical second state representsa sample associated with a disease or disorder, and a rule determined tobe true identifies biological elements represented by its operands asdrug targets.
 84. The method of claim 82, further comprising identifyinga first profile for a first sample associated with the hypotheticalfirst state of the biological system, identifying a second profile for asecond sample associated with the hypothetical second state of thebiological system; and parsing the first and second profiles to producethe first and a second set of symbols.
 85. The method of claim 84wherein the first and second samples have one or more geneticalterations with respect to one another.
 86. The method of claim 84 inwhich the first and second profiles include information about mRNAexpression.
 87. The method of claim 84 in which the first and secondprofiles include information about polypeptide abundance.
 88. The methodof claim 84 in which the first and second profiles include informationabout polypeptide modification.
 89. The method of claim 84 in which thefirst and second profiles include information about metaboliteabundance.
 90. The method of claim 82 in which one or more of the rulesexpresses concurrent state transitions.
 91. The method of claim 82 inwhich at least some of the rules are not terminating.
 92. The method ofclaim 82 in which at least one of the rules represents a feedback orfeedforward interaction between biological elements.
 93. The method ofclaim 82 in which at least one or more of the symbols representing thebiological elements is typed.
 94. The method of claim 93 in which thetypes of symbols are organized in hierarchical classes.
 95. The methodof claim 82 wherein the hypothetical first state represents ahypothetical reference sample, the hypothetical second state representsa sample contacted with a drug or exogenous agent, and a rule determinedto be true identifies biological elements represented by its operands asdrug targets.
 96. An article comprising machine-readable media havingencoded thereon software configured to cause the processor to: receive aset of symbols, the set representing a hypothetical initial state of abiological system, the symbols representing biological elements of thesystem; and iteratively substitute one or more of the symbolsrepresenting biological elements by at least another symbol representinga biological element using rules that represent interactions between thebiological elements until a terminal state or until alternativeresultant states are detected.
 97. The article of claim 96 wherein oneor more of the rules comprises an operator for expressing a relationshipbetween at least two of the biological elements, the operator conformingto associative and commutative properties.
 98. The article of claim 96wherein one or more of the rules expresses concurrent state transitions.99. The article of claim 96 wherein at least some of the rules are notterminating.
 100. The article of claim 96 wherein at least one of therules represents a feedback or feedforward interaction betweenbiological elements.
 101. The article of claim 96 wherein one or more ofthe rules is reflective.
 102. The article of claim 96 wherein one ormore of the symbols representing the biological elements is typed. 103.The article of claim 102 wherein the types of symbols are organized inhierarchical classes.
 104. The article of claim 103 wherein a symbol forone of the hierarchical classes is matched by any symbol that is amember of the hierarchical class.
 105. The article of claim 96 whereinthe software is further to cause the processor to: receive a second setof symbols for a hypothetical second state of the biological system; andcompare the second set of symbols to the terminal state or to at leastone of the alternative resultant states.
 106. An article comprisingmachine-readable media having encoded thereon software configured tocause the processor to: receive information for a first state of abiological system; generate symbols representing biological elements ofthe system; and iteratively substitute one or more of the symbolsrepresenting biological elements by at least another symbol representinga biological element using rules that represent interactions between thebiological elements until a terminal state or until alternativeresultant states are detected.
 107. The article of claim 106 wherein oneor more of the symbols representing the biological elements is typed.108. The article of claim 106 wherein the information comprises values,each value reflecting the abundance of a biological element in the firststate.
 109. The article of claim 108 wherein generating comprisescomparing each value to a threshold parameter for the value, andgenerating a symbol for the biological element whose abundance isreflected by the value if the value exceeds the threshold parameter.